首页> 外文OA文献 >Estimating the number of unseen variants in the human genome
【2h】

Estimating the number of unseen variants in the human genome

机译:估计人类基因组中看不见的变体的数量

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

The different genetic variation discovery projects (The SNP Consortium, the International HapMap Project, the 1000 Genomes Project, etc.) aim to identify as much as possible of the underlying genetic variation in various human populations. The question we address in this article is how many new variants are yet to be found. This is an instance of the species problem in ecology, where the goal is to estimate the number of species in a closed population. We use a parametric beta-binomial model that allows us to calculate the expected number of new variants with a desired minimum frequency to be discovered in a new dataset of individuals of a specified size. The method can also be used to predict the number of individuals necessary to sequence in order to capture all (or a fraction of) the variation with a specified minimum frequency. We apply the method to three datasets: the ENCODE dataset, the SeattleSNPs dataset, and the National Institute of Environmental Health Sciences SNPs dataset. Consistent with previous descriptions, our results show that the African population is the most diverse in terms of the number of variants expected to exist, the Asian populations the least diverse, with the European population in-between. In addition, our results show a clear distinction between the Chinese and the Japanese populations, with the Japanese population being the less diverse. To find all common variants (frequency at least 1%) the number of individuals that need to be sequenced is small (∼350) and does not differ much among the different populations; our data show that, subject to sequence accuracy, the 1000 Genomes Project is likely to find most of these common variants and a high proportion of the rarer ones (frequency between 0.1 and 1%). The data reveal a rule of diminishing returns: a small number of individuals (∼150) is sufficient to identify 80% of variants with a frequency of at least 0.1%, while a much larger number (> 3,000 individuals) is necessary to find all of those variants. Finally, our results also show a much higher diversity in environmental response genes compared with the average genome, especially in African populations.
机译:不同的遗传变异发现项目(SNP联盟,国际HapMap项目,1000个基因组项目等)旨在尽可能地识别各种人群中潜在的遗传变异。我们在本文中解决的问题是还有多少新的变体尚未找到。这是生态学中物种问题的一个实例,其目的是估计封闭种群中的物种数量。我们使用参数化的β-二项式模型,该模型使我们能够计算具有期望的最小频率的新变体的预期数量,该最小变体将在指定大小的个体的新数据集中被发现。该方法还可用于预测测序所需个体的数量,以便捕获具有指定最小频率的所有(或部分)变异。我们将该方法应用于三个数据集:ENCODE数据集,SeattleSNPs数据集和美国国家环境健康科学研究所SNPs数据集。与先前的描述相符,我们的结果表明,就预期存在的变体数量而言,非洲人口多样性最多,亚洲人口多样性最少,欧洲人口介于两者之间。此外,我们的结果显示出中国人和日本人之间的明显区别,而日本人的多样性则较低。为了找到所有常见的变异(频率至少为1%),需要测序的个体数量很少(〜350),并且在不同人群之间差异不大。我们的数据表明,受序列准确性的影响,“ 1000基因组计划”很可能会发现这些常见变体中的大多数以及稀有变种中的大部分(频率介于0.1%和1%之间)。数据揭示了收益递减的规则:少量个体(〜150个)足以识别频率至少为0.1%的80%变体,而发现全部个体所需的更大数量(> 3,000个个体)这些变体中。最后,我们的结果还表明,与平均基因组相比,环境响应基因的多样性要高得多,尤其是在非洲人群中。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号